Group 4: Zhaohui (Emily) Fan, Amber Moon, Hansol Kim
Kaggle Datasource: https://www.kaggle.com/rikdifos/credit-card-approval-prediction
Other Context: https://mp.weixin.qq.com/s/upjzuPg5AMIDsGxlpqnoCg
Credit score cards are a common risk control method in the financial industry. The credit card data contains personal information and data submitted by credit card applicants that we can use to predict whether the bank is able to issue a credit card to the applicant. There are two data files used, the application data and the credit data which includes monthly credit card account status information.
Explain the business problem clearly you are trying to solve using machine learning and data mining. Remember, you need to approach this as if you were presenting to your boss, a CEO, or a Board of Directors. (Don't afraid to show your technical and communication skills. You need to select relevant business questions to answer)
# import libraries and data
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from imblearn.over_sampling import SMOTE
import itertools
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, plot_confusion_matrix, precision_score, recall_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
applications_df = pd.read_csv('application_record.csv')
credit_df = pd.read_csv('credit_record.csv')
# plt.rcParams['figure.facecolor'] = 'white'
Primary business question: which features comprise a potentially risky candidate to extend credit to?
For the purposes of this assignment, we will assume that a unique id equates to a unique persona to be included in modeling. Therefore, data shaping is based on unique ID's that appear in each dataset.
# For Applications dataset:
# View shape, nulls, cardinality, duplication (unique count of ID)
applications_df.shape
applications_df.isnull().sum()/applications_df.shape[0]
# For Credit Records dataset:
# View shape, nulls, cardinality, duplication (unique count of ID)
credit_df.shape
credit_df.isnull().sum()
print("Unique ID's in credit records dataset:")
print(credit_df['ID'].nunique())
print('')
print("Unique ID's in applications dataset:")
print(applications_df['ID'].nunique())
Addressing Null Values
# 30% of Occupation Type column is null
applications_df['OCCUPATION_TYPE'].value_counts().sort_values().plot(kind='barh', figsize=(7,5))
Because Occupation Type is a categorical variable, some possible options for addressing nulls are:
1) replace nulls with the mode 2) ignore observations 3) create a new category within variable 4) predict the observation
However, since our dataset includes a comparable variable called Income Type, we will drop Occupation Type altogether.
applications_df.drop('OCCUPATION_TYPE', axis=1, inplace=True)
We also see above that the Credit Records dataset has a long structure with 45,985 unique values in 1,048,575 rows, while the Applications dataset has minimal duplication.
Therefore, to merge the two into one de-duplicated dataset with our target prediction variable we will:
1) Identify how many IDs are common to both datasets (36,457)
2) Reshape Credit Records by Grouping by ID
3) Identify IDs where at any point in time there was a balance 60+ days overdue
len(set(applications_df['ID']).intersection(set(credit_df['ID'])))
id_index = list(set(applications_df['ID']).intersection(set(credit_df['ID'])))
applications_df = applications_df[applications_df['ID'].isin(id_index)]
print(applications_df.shape)
credit_df['dep_value'] = None
credit_df['dep_value'][credit_df['STATUS'] =='2']='Yes'
credit_df['dep_value'][credit_df['STATUS'] =='3']='Yes'
credit_df['dep_value'][credit_df['STATUS'] =='4']='Yes'
credit_df['dep_value'][credit_df['STATUS'] =='5']='Yes'
resp=credit_df.groupby('ID').count()
resp['dep_value'][resp['dep_value'] > 0]='Yes'
resp['dep_value'][resp['dep_value'] == 0]='No'
resp = resp[['dep_value']]
df=pd.merge(applications_df,resp,how='inner',on='ID')
df['target']=df['dep_value']
df.loc[df['target']=='Yes','target']=1
df.loc[df['target']=='No','target']=0
df.head().T
#rename columns
df.rename(columns={'CODE_GENDER':'Gender','FLAG_OWN_CAR':'Car','FLAG_OWN_REALTY':'Reality','CNT_CHILDREN':'Children','AMT_INCOME_TOTAL':'income',
'NAME_EDUCATION_TYPE':'education','NAME_HOUSING_TYPE':'housing','FLAG_EMAIL':'email','NAME_INCOME_TYPE':'income_type','CNT_FAM_MEMBERS':'family_size',
},inplace=True)
# How do income and education relate to credit status?
plt.figure(figsize=(5,5))
sns.barplot(x="income", y="education", hue="dep_value", data=df)
plt.figure(figsize=(5,5))
sns.stripplot(x="dep_value", y="income", data=df, jitter=True)
plt.figure(figsize=(5,15))
sns.catplot(x="dep_value", y="income", hue="Gender", kind="box", data=df)
# add column for age in years
df['age'] = df['DAYS_BIRTH']/-365
# add column for years in workforce
df['years_working'] = df['DAYS_EMPLOYED']/-365
# drop columns we aren't going to use
df = df.drop(columns={'ID', 'FLAG_MOBIL', 'FLAG_WORK_PHONE', 'FLAG_PHONE', 'email', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'NAME_FAMILY_STATUS'})
# create subset risky_df to look for relationships within only "high risk"
# "high risk" = at any point in time had status 60 days or more overdue
risky = ['1']
risky_df = df.loc[df['target'].isin(risky)]
# Within risky subset, where can relationships be identified?
plt.figure(figsize=(10,20))
sns.pairplot(risky_df)
# Within risky subset, what are the relationships between income, age, gender and credit status?
plt.figure(figsize=(10, 10))
sns.lmplot(x="income", y="age", hue="Gender", data=risky_df)
# Income by gender of applicants in dataset
sns.catplot(data=df.sort_values("income"), orient="h", kind="box", x="income", y="Gender", height=3, aspect=2)
# Family Size by gender of applicants in dataset
sns.catplot(data=df.sort_values("family_size"), orient="h", kind="box", x="family_size", y="Gender", height=3, aspect=2)
# will add a couple more charts that show other features
# prior to submitting project if time
# histograms
# heatmap, etc.
Binary Data
# replace M/F with 0/1
df['Gender'] = df['Gender'].replace(['F','M'],[1,0])
print(df['Gender'].value_counts())
# replace car flag with 0/1
df['Car'] = df['Car'].replace(['N','Y'],[0,1])
print(df['Car'].value_counts())
# replace realty with 0/1
df['Reality'] = df['Reality'].replace(['N','Y'],[0,1])
print(df['Reality'].value_counts())
Categorical Data
# Convert continuous income to category
df['income'] = pd.cut(df['income'], bins=3, labels=["low", "medium", "high"])
# Convert continuous Children to category
df.loc[df['Children'] >= 2,'Children']='2orMore'
# Convert continuous Family Size to category
df.loc[df['family_size'] >= 3,'family_size']='3orMore'
# Convert continuous Age to category
df['age'] = pd.cut(df['age'], bins=3, labels=["young", "middle_age", "older"])
# Convert continuous Years Working to category
df['years_working'] = pd.cut(df['years_working'], bins=3, labels=["entry", "mid_career", "seasoned"])
df.head(2)
One Hot Encoding
# convert all categories to binary
df = pd.get_dummies(df, columns=['income_type',
'education',
'housing',
'income',
'Children',
'family_size',
'age',
'years_working'])
df.head(2)
# rename columns
df = df.drop(columns={'dep_value'})
df.columns = ['Gender','Car','Reality','target','income_type_Commercial','income_type_Pensioner','income_type_State',
'income_type_Student','income_type_Working','education_degree','education_Higher_ed',
'education_Incomplete_higher','education_Lower_secondary','education_Secondary','housing_apartment',
'housing_house','housing_municipal','housing_office_apartment','housing_rented_apartment','housing_parents',
'income_low','income_medium','income_high','Children_0','Children_1','Children_2orMore','family_size_1',
'family_size_2','family_size_3orMore','age_young','age_middle','age_older','years_working_entry',
'years_working_mid_career','years_working_seasoned']
df['income_type_Commercial'] = df.income_type_Commercial.astype('int64')
df['income_type_Pensioner'] = df.income_type_Pensioner.astype('int64')
df['income_type_State'] = df.income_type_State.astype('int64')
df['income_type_Student'] = df.income_type_Student.astype('int64')
df['income_type_Working'] = df.income_type_Working.astype('int64')
df['education_degree'] = df.education_degree.astype('int64')
df['education_Higher_ed'] = df.education_Higher_ed.astype('int64')
df['education_Incomplete_higher'] = df.education_Incomplete_higher.astype('int64')
df['education_Lower_secondary'] = df.education_Lower_secondary.astype('int64')
df['education_Secondary'] = df.education_Secondary.astype('int64')
df['housing_apartment'] = df.housing_apartment.astype('int64')
df['housing_house'] = df.housing_house.astype('int64')
df['housing_municipal'] = df.housing_municipal.astype('int64')
df['housing_office_apartment'] = df.housing_office_apartment.astype('int64')
df['housing_rented_apartment'] = df.housing_rented_apartment.astype('int64')
df['housing_parents'] = df.housing_parents.astype('int64')
df['income_low'] = df.income_low.astype('int64')
df['income_medium'] = df.income_medium.astype('int64')
df['income_high'] = df.income_high.astype('int64')
df['Children_0'] = df.Children_0.astype('int64')
df['Children_1'] = df.Children_1.astype('int64')
df['Children_2orMore'] = df.Children_2orMore.astype('int64')
df['family_size_1'] = df.family_size_1.astype('int64')
df['family_size_2'] = df.family_size_2.astype('int64')
df['family_size_3orMore'] = df.family_size_3orMore.astype('int64')
df['age_young'] = df.age_young.astype('int64')
df['age_middle'] = df.age_middle.astype('int64')
df['age_older'] = df.age_older.astype('int64')
df['years_working_entry'] = df.years_working_entry.astype('int64')
df['years_working_mid_career'] = df.years_working_mid_career.astype('int64')
df['years_working_seasoned'] = df.years_working_seasoned.astype('int64')
Split data and fix imbalance using SMOTE
Instructor Comments
# columns of data we will use to make classifications
X = df.drop('target', axis=1).copy()
# what we want to predict
y = df['target'].copy()
# Checking if the data is imbalanced or not
sum(y)/ len(y)
# 1.69% of the applicants are target users. We need to make sure we maintain the same % across training and testing datasets.
# It's called "stratification": split the data to maintain the ratio.
# Using Synthetic Minority Over-Sampling Technique(SMOTE) to overcome sample imbalance problem.
X_balance,Y_balance = SMOTE().fit_sample(X,y)
X_balance = pd.DataFrame(X_balance, columns = X.columns)
X_train, X_test, y_train, y_test = train_test_split(X_balance,Y_balance,
stratify=Y_balance, test_size=0.3,
random_state = 1024)
logistic_model = LogisticRegression(C=0.8,
random_state=0,
solver='lbfgs')
logistic_model.fit(X_train, y_train)
y_predict = logistic_model.predict(X_test)
print('Accuracy Score is {:.5}'.format(accuracy_score(y_test, y_predict)))
print(pd.DataFrame(confusion_matrix(y_test,y_predict)))
print('Accuracy Score is {:.5}'.format(accuracy_score(y_test, y_predict)))
# precision = (TP) / (TP+FP)
print('Precision Score is {:.5}'.format(precision_score(y_test, y_predict)))
# recall = (TP) / (TP+FN)
print('Recall Score is {:.5}'.format(recall_score(y_test, y_predict)))
%%time
clf_dt = DecisionTreeClassifier(random_state=1024)
clf_dt = clf_dt.fit(X_train, y_train)
y_predict = clf_dt.predict(X_test)
plot_confusion_matrix(clf_dt, X_test, y_test, display_labels=["No Balance > 60 Days", "Had Balance > 60 Days"])
path = clf_dt.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
ccp_alphas = ccp_alphas[:-1]
clf_dts = []
for ccp_alpha in ccp_alphas:
clf_dt = DecisionTreeClassifier(random_state=1024, ccp_alpha=ccp_alpha)
clf_dt.fit(X_train, y_train)
clf_dts.append(clf_dt)
# plot accuracy
train_scores = [clf_dt.score(X_train, y_train) for clf_dt in clf_dts]
test_scores = [clf_dt.score(X_test, y_test) for clf_dt in clf_dts]
fig, ax = plt.subplots()
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label="test", drawstyle="steps-post")
ax.legend()
plt.show()
%%time
# use cross validation to find best alpha
alpha_loop_values = []
for ccp_alpha in ccp_alphas:
clf_dt = DecisionTreeClassifier(random_state=0, ccp_alpha=ccp_alpha)
scores = cross_val_score(clf_dt, X_train, y_train, cv=5)
alpha_loop_values.append([ccp_alpha, np.mean(scores), np.std(scores)])
alpha_results = pd.DataFrame(alpha_loop_values,
columns=['alpha', 'mean_accuracy', 'std'])
alpha_results.plot(x='alpha',
y='mean_accuracy',
yerr='std',
marker='o',
linestyle='--')
# Find exact value
alpha_results[(alpha_results['alpha'] > 0.000)
&
(alpha_results['alpha'] < 0.001)]
# Store ideal value for alpha
ideal_ccp_alpha = 0.001
%%time
clf_dt_pruned = DecisionTreeClassifier(random_state=1024,
ccp_alpha=ideal_ccp_alpha)
clf_dt_pruned = clf_dt_pruned.fit(X_train, y_train)
y_predict_pruned = clf_dt_pruned.predict(X_test)
print(pd.DataFrame(confusion_matrix(y_test,y_predict_pruned)))
# confusion matrix
plot_confusion_matrix(clf_dt_pruned,
X_test,
y_test,
display_labels=["No Balance > 60 Days", "Had Balance > 60 Days"])
print('Accuracy Score is {:.5}'.format(accuracy_score(y_test, y_predict)))
# precision = (TP) / (TP+FP)
print('Precision Score is {:.5}'.format(precision_score(y_test, y_predict)))
# recall = (TP) / (TP+FN)
print('Recall Score is {:.5}'.format(recall_score(y_test, y_predict)))
# Draw Decision Tree
plt.figure(figsize=(25,25))
plot_tree(clf_dt_pruned,
filled=True,
rounded=True,
class_names=["No Overdue", "Yes Overdue"],
feature_names=X.columns)
%%time
clf_rf = RandomForestClassifier(n_estimators=250,
max_depth=50,
min_samples_leaf=16
)
clf_rf.fit(X_train, y_train)
y_predict = clf_rf.predict(X_test)
print(pd.DataFrame(confusion_matrix(y_test,y_predict)))
print('Accuracy Score is {:.5}'.format(accuracy_score(y_test, y_predict)))
# precision = (TP) / (TP+FP)
print('Precision Score is {:.5}'.format(precision_score(y_test, y_predict)))
# recall = (TP) / (TP+FN)
print('Recall Score is {:.5}'.format(recall_score(y_test, y_predict)))
clf_rf_matrix = pd.crosstab(y_test, y_predict, rownames=['Actual'], colnames=['Predicted'])
sns.heatmap(clf_rf_matrix, annot=True)
%%time
from lightgbm import LGBMClassifier
clf_gbm = LGBMClassifier(num_leaves=35,
max_depth=8,
learning_rate=0.02,
n_estimators=250,
subsample = 0.8,
colsample_bytree =0.8
)
clf_gbm.fit(X_train, y_train)
y_predict = clf_gbm.predict(X_test)
print(pd.DataFrame(confusion_matrix(y_test,y_predict)))
print('Accuracy Score is {:.5}'.format(accuracy_score(y_test, y_predict)))
# precision = (TP) / (TP+FP)
print('Precision Score is {:.5}'.format(precision_score(y_test, y_predict)))
# recall = (TP) / (TP+FN)
print('Recall Score is {:.5}'.format(recall_score(y_test, y_predict)))
# Confusion Matrix on the test data
plot_confusion_matrix(clf_gbm,
X_test,
y_test,
values_format='d',
display_labels=["No Balance > 60 Days", "Had Balance > 60 Days"])
#Showing important features:
def plot_importance(classifer, x_train, point_size = 25):
#plot feature importance
values = sorted(zip(x_train.columns, classifer.feature_importances_), key = lambda x: x[1] * -1)
imp = pd.DataFrame(values,columns = ["Name", "Score"])
imp.sort_values(by = 'Score',inplace = True)
b = sns.scatterplot(x = 'Score',y='Name', linewidth = 0,
data = imp,s = point_size, color='red')
fig = plt.gcf()
fig.set_size_inches(7, 15) # Change plot size
b.set_xlabel("importance",fontsize=15)
b.set_ylabel("features",fontsize=15)
b.tick_params(labelsize=10)
plot_importance(clf_gbm, X_train,20)
%%time
import xgboost as xgb
# Creating the XGBClassfier shell
clf_xgb = xgb.XGBClassifier(objective='binary:logistic', missing=None, seed=42)
'''
Instead of finding the optimal number of tress using K-Cross Validation,
we use early stopping to stop the tree when the cost function no longer reduces.
XGBoost will do the cross validation; we just have to specifiy the
number of rounds to exhaust with no improvement before stopping.
We use evaluation metric as Area Under Precision-Recall Curve
'''
clf_xgb.fit(X_train,
y_train,
verbose=True,
early_stopping_rounds=10,
eval_metric='aucpr',
eval_set=[(X_test, y_test)])
# After building 99 trees, the model doesn't improve any longer. base_score=0.5
# So we check for the next 10 iterations and stop.
# make predictions for test data
y_predict = clf_xgb.predict(X_test)
# evaluate predictions
print(pd.DataFrame(confusion_matrix(y_test,y_predict)))
print('Accuracy Score is {:.5}'.format(accuracy_score(y_test, y_predict)))
# precision = (TP) / (TP+FP)
print('Precision Score is {:.5}'.format(precision_score(y_test, y_predict)))
# recall = (TP) / (TP+FN)
print('Recall Score is {:.5}'.format(recall_score(y_test, y_predict)))
# Confusion Matrix on the test data
plot_confusion_matrix(clf_xgb,
X_test,
y_test,
values_format='d',
display_labels=["No Balance > 60 Days", "Had Balance > 60 Days"])
xgb.plot_importance(clf_xgb)
plt.rcParams['figure.figsize'] = [10, 15]
plt.show()
## graphviz got issue with my laptop's system and unavilable for a while.
## to prevent more time consumption, will leave this graph out of the project for now
from xgboost import plot_tree
from graphviz import Digraph
# Draw XGBoost Tree
fig = plt.figure(figsize=(10, 10))
ax = fig.subplots()
xgb.plot_tree(clf_xgb, num_trees=20, ax=ax)
plt.show()
# CalledProcessError: Command '['dot', '-Tpng']' returned non-zero exit status 4294967295
1) Ways To Handle Categorical Column Missing Data & Its Implementations. https://medium.com/analytics-vidhya/ways-to-handle-categorical-column-missing-data-its-implementations-15dc4a56893